Introduction

In this project we are trying first to explore our data to get a better understanding To answer our questions we first need to have a preliminary look at our data, so that we can get a better a idea what we are dealing with, as well as the possible missing data and relationships that exist

Preliminary Look at the data

We need first to define the data we have.

Variable Definition Key
survival Survival 0 = No, 1 = yes
pclass ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex sex
age Age in year
sibsp Number of siblings/spouses aboard the titanic
parch Number of parents/children aboard the Titanic
ticket ticket number(unique)
fare Passenger fare
cabin Cabin number
embarked port of embarkation C = Cherbourg, Q = Queens-town, S = Southampton

Loading the packages and the Data

# Loading Packages
## tidyverse loads dplyr and readr
library(tidyverse)

## To have different color maps
library(viridis)

## ggplot2 to produce different plots
library(ggplot2)

## uses ggplot2 to produce a correlation matrix -- the data must be in the correct form
library(ggcorrplot)

## Gives us better themes
library(hrbrthemes)

## to use skewness fun. to calculate skewness of the distribution
library(e1071)

## Multivariate imputation using chained equations -- to impute the missing values in our data
library(mice)

## Loads different statistical functions
library(statsr)

## To produce interactive plot
library(plotly)


# Loading Training Data
train <- read_csv("data/train.csv")

# Loading Testing Data
test <- read_csv("data/test.csv")

# Binding them into a full data frame
df <- bind_rows(train,test)

Exploration Of The Data

Description of the Data

Summary of Data

summary(train)
  PassengerId       Survived          Pclass          Name               Sex                 Age       
 Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891         Length:891         Min.   : 0.42  
 1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character   Class :character   1st Qu.:20.12  
 Median :446.0   Median :0.0000   Median :3.000   Mode  :character   Mode  :character   Median :28.00  
 Mean   :446.0   Mean   :0.3838   Mean   :2.309                                         Mean   :29.70  
 3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                                         3rd Qu.:38.00  
 Max.   :891.0   Max.   :1.0000   Max.   :3.000                                         Max.   :80.00  
                                                                                        NA's   :177    
     SibSp           Parch           Ticket               Fare           Cabin          
 Min.   :0.000   Min.   :0.0000   Length:891         Min.   :  0.00   Length:891        
 1st Qu.:0.000   1st Qu.:0.0000   Class :character   1st Qu.:  7.91   Class :character  
 Median :0.000   Median :0.0000   Mode  :character   Median : 14.45   Mode  :character  
 Mean   :0.523   Mean   :0.3816                      Mean   : 32.20                     
 3rd Qu.:1.000   3rd Qu.:0.0000                      3rd Qu.: 31.00                     
 Max.   :8.000   Max.   :6.0000                      Max.   :512.33                     
                                                                                        
   Embarked        
 Length:891        
 Class :character  
 Mode  :character  
                   
                   
                   
                   

Categories of Features

Quantitative data are measures of values or counts and are expressed as numbers.

Qualitative data are measures of ‘types’ and may be represented by a name, symbol, or a number code.

Qualitive

Categorical: Survived, Sex, and Embarked. Ordinal: Pclass. Nominal: Name.

Quantitive

Continuous: Age, Fare. Discrete: SibSp, Parch.

Mix types

Ticket is a mix of numeric and alphanumeric data types Cabin is mix between alpha and numeric

Exploring Missing Data.

Checking for Missing values in each feature

colSums(is.na(train))
PassengerId    Survived      Pclass        Name         Sex         Age       SibSp       Parch 
          0           0           0           0           0         177           0           0 
     Ticket        Fare       Cabin    Embarked 
          0           0         687           2 
missing_values <- train %>% summarize_all(funs(sum(is.na(.))/n()))

missing_values <- gather(missing_values, key="feature", value="missing_pct")

missing_values

missing_values %>% 
  ggplot(aes(x=reorder(feature,-missing_pct),y=missing_pct)) +
  geom_bar(stat="identity",fill="red") +
  coord_flip() + # to flip the graph
  xlab("Features") +
  ylab("Percentage of missing values")+
  scale_y_continuous(labels=scales::percent) +
  theme_ipsum()

Missing data, it is normal?

It is quite normal to see missing data in any data-set as such data is collected by manually which means that there might be some error. Missing data present various problems. First, the absence of data reduces statistical power, which refers to the probability that the test will reject the null hypothesis when it is false. Second, the lost data can cause bias in the estimation of parameters. Third, it can reduce the representativeness of the samples.

The solution to the missing data

Removal

The removal of the observations that contain missing might cause a bigger issue where it might an even bigger loss of information which cause bias in our estimation. But on close observation of our data we can identify that there are some features that are not extremely useful and contain missing data, so we can drop such features.

train <- train %>%
  select(!Cabin)
summary(train)
  PassengerId       Survived          Pclass          Name               Sex                 Age       
 Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891         Length:891         Min.   : 0.42  
 1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character   Class :character   1st Qu.:20.12  
 Median :446.0   Median :0.0000   Median :3.000   Mode  :character   Mode  :character   Median :28.00  
 Mean   :446.0   Mean   :0.3838   Mean   :2.309                                         Mean   :29.70  
 3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                                         3rd Qu.:38.00  
 Max.   :891.0   Max.   :1.0000   Max.   :3.000                                         Max.   :80.00  
                                                                                        NA's   :177    
     SibSp           Parch           Ticket               Fare          Embarked        
 Min.   :0.000   Min.   :0.0000   Length:891         Min.   :  0.00   Length:891        
 1st Qu.:0.000   1st Qu.:0.0000   Class :character   1st Qu.:  7.91   Class :character  
 Median :0.000   Median :0.0000   Mode  :character   Median : 14.45   Mode  :character  
 Mean   :0.523   Mean   :0.3816                      Mean   : 32.20                     
 3rd Qu.:1.000   3rd Qu.:0.0000                      3rd Qu.: 31.00                     
 Max.   :8.000   Max.   :6.0000                      Max.   :512.33                     
                                                                                        

Imputation

Imputing the missing data gives the advantage that we use a learning model to predict such missing data and maintain the distribution of our data.

Histogram of Age feature.

gAgeDensity <- train %>%
  select(Age) %>%
  ggplot(aes(Age, y = ..density..)) +
  geom_histogram(bins = 20,binwidth = 1,color=inferno(1,alpha=1)) + 
  geom_density(fill=inferno(1,begin = 0.5,alpha = 0.5),color = inferno(1,begin=0)) + 
  annotate(
    "text",
    x = 70,
    y = 0.04,
    label = paste("Skewness:",skewness(train$Age,na.rm = T)),
    colour = inferno(1,begin = 0.1),
    size = 4
  ) + 
  theme_ipsum_rc()

gAgeDensity

Population mean and standard deviation of Age feature Before imputation

train %>%
  summarise(Age_mean = mean(Age,na.rm = T), Age_sd = sd(Age,na.rm = T))

Histogram of Fare feature

gFareDensity <- train %>%
  select(Fare) %>%
  ggplot(aes(Fare, y = ..density..)) +
  geom_histogram(bins = 20,binwidth = 1,color=viridis(1,alpha=1)) + 
  geom_density(fill=inferno(1,begin = 0.5,alpha = 0.5),color = viridis(1,begin=0)) + 
  scale_y_continuous(limits = c(0,0.05))+
  theme_ipsum_rc() + 
  annotate(
    "text",
    x = 200,
    y = 0.05,
    label = paste("Skewness",skewness(train$Fare)),
    colour = "black",
    size = 4
  )

gFareDensity 

Population mean and standard deviaiton of Fare feature.

train %>%
  summarise(Fare_mean = mean(Fare,na.rm = T), Fare_sd = sd(Fare,na.rm = T))

Imputing the missing age feature

#---------------MICE--------------
set.seed(129)
mice_mod <- mice(train[,!names(train) %in% c('PassengerId','Name','Ticket','Cabin','Survived')],method = 'rf')

 iter imp variable
  1   1  Age
  1   2  Age
  1   3  Age
  1   4  Age
  1   5  Age
  2   1  Age
  2   2  Age
  2   3  Age
  2   4  Age
  2   5  Age
  3   1  Age
  3   2  Age
  3   3  Age
  3   4  Age
  3   5  Age
  4   1  Age
  4   2  Age
  4   3  Age
  4   4  Age
  4   5  Age
  5   1  Age
  5   2  Age
  5   3  Age
  5   4  Age
  5   5  Age
Warning: Number of logged events: 2
mice_output <- complete(mice_mod)

gdistrOriginalData <- train %>%
  select(Age) %>%
  ggplot(aes(Age, y = ..density..)) +
  geom_histogram(bins = 25,binwidth = 1,color=inferno(1,alpha=1)) + 
  geom_density(fill=inferno(1,begin = 0.5,alpha = 0.5),color = inferno(1,begin=0)) +
  ggtitle("Distribution of original data") +
  theme_ipsum_rc()
gdistrMICEData <- mice_output %>%
  select(Age) %>%
  ggplot(aes(Age, y = ..density..)) +
  geom_histogram(bins = 25,binwidth = 1,color=inferno(1,alpha=1)) + 
  geom_density(fill=inferno(1,begin = 0.5,alpha = 0.5),color = inferno(1,begin=0)) + 
  ggtitle("Distribution of mice output") +
  theme_ipsum_rc()

gridExtra::grid.arrange(gdistrOriginalData,gdistrMICEData,nrow = 1)


train$Age <- mice_output$Age

missing_values <- train %>% summarize_all(funs(sum(is.na(.))/n()))

missing_values <- gather(missing_values, key="feature", value="missing_pct")

missing_values

missing_values %>% 
  ggplot(aes(x=reorder(feature,-missing_pct),y=missing_pct)) +
  geom_bar(stat="identity",fill="red") +
  coord_flip() + # to flip the graph
  xlab("Features") +
  ylab("Percentage of missing values")+
  scale_y_continuous(labels=scales::percent) +
  theme_ipsum()

Population mean and standard deviation

train %>%
  summarise(Age_mean = mean(Age,na.rm = T), Age_sd = sd(Age,na.rm = T))

Determining the distrubutuion of Age and Fare By inspection

Though determination of the distribution using inspection is likely not going to be effective we are going to the KS-test in a later section

Age

The histogram of the Age feature look very much like a normal distribution, yet it’s not a normal distribution itself.

Fare

The histogram of the Fare feature fits the chi-square distribution.

Plotting The Data

We include even more plots of our data to get a better understanding of it, help us see hidden correlations and ways to facilitate our analysis

Correlation Matrix

We are going to use correlation matrix of the numerical data to assess the correlation, which might gives a better idea of which feature might be important

correlationMatrix <- train %>%
  filter(!is.na(Age)) %>%
  select(Survived, Pclass,Age,SibSp,Parch,Fare) %>%
  cor() %>%
  ggcorrplot(lab = T,
             ggtheme =theme_ipsum_rc(grid = F),
             title="Correlation Matrix",hc.order=T,
             colors =rev(viridis(3,alpha=0.7)),
             digits = 1)

correlationMatrix

The fare features seems to be the most correlated feature to survival of the passengers, but it doesn’t negate the importance of the other features in the data. Which means that we will start by comparing the each that we consider to be important against survival feature

Class of Passenger Vs Survived

gPclassSurvived <- train %>%
  select(Pclass,Survived) %>%
  ggplot(aes(as_factor(Pclass),fill=as_factor(Survived))) + 
  geom_bar(position = "fill") +
  scale_y_continuous(labels=scales::percent) +
  theme_ipsum_rc() + 
  labs(x = "Classes",y = "Survival Rate")+
  scale_fill_discrete(name = "Survived", labels = c("Didn't Survive","Survived"))

gPclassSurvived

From this Plot, it seems clear that people from the upper classes had higher survival rates, thought this seemed obvious from the beginning. ### Siblings and Spouses Vs Survived

gSibSpSurvived <- train %>%
  select(SibSp,Survived) %>%
  ggplot(aes(as_factor(SibSp),fill=as_factor(Survived))) +
  geom_bar(position = "fill") + 
  scale_y_continuous(labels = scales::percent) +
  labs(x = "Siblings and Spouses",y = "Survival Rate")+
  scale_fill_discrete(name = "Survived", labels = c("Didn't Survive","Survived")) +
  theme_ipsum()

gSibSpSurvived

This plot highlight a point that might be counter-intuitive that people that have no siblings or spouses had lower survival rates than those with at least one sibling or a spouse

Number of children/parents Vs Survived

gParchSurvived <- train %>%
  select(Parch,Survived) %>%
  ggplot(aes(as_factor(Parch),fill=as_factor(Survived))) + 
  geom_bar(position = "fill") +
  scale_y_continuous(label = scales::percent)+
  labs(x = "Number of parents/children",y = "Survival Rate")+
  scale_fill_discrete(name = "Survived", labels = c("Didn't Survive","Survived")) +
  theme_ipsum_rc()
gParchSurvived

Gender VS Survived

gSexSurvived <- train %>%
  select(Sex,Survived) %>%
  ggplot(aes(as_factor(Sex),fill = as_factor(Survived))) + 
  geom_bar(position = "fill") +
  scale_y_continuous(label = scales::percent) + 
  labs(x = "Sex",y = "Survival Rate")+
  scale_fill_discrete(name = "Survived", labels = c("Didn't Survive","Survived")) +
  theme_ipsum_rc()
print(gSexSurvived)

Dashboard of the previous graphs

gridExtra::grid.arrange(gPclassSurvived,
                        gSibSpSurvived,
                        gSexSurvived,
                        gParchSurvived,
                        nrow=2)
train %>%
  group_by(Sex) %>%
  summarise(Age_mean = mean(Age,na.rm=TRUE),
            age_sd = sd(Age,na.rm=T),
            surival_mean = mean(Survived,na.rm =T),
            surival_sd = sd(Survived,na.rm = T))

Sampling

Random sample of size of 50 (Age)

sample_50 <- sample_n(train,size = 50,replace = T) %>%
  summarise(mean = mean(Age), sd = sd(Age))
sample_50

Sampling distrubution with a fixed size (50)

numOfSamples <- seq(50,1000,50)

smplngDstrbtnRpsChng <- tibble()

for(i in numOfSamples){
  
  for(y in 1:i){
    nsample <- sample_n(train,size=50,replace=T) %>%
      select(Age)
    newRow <- nrow(smplngDstrbtnRpsChng) + 1
    smplngDstrbtnRpsChng[newRow,"reps"] <- i
    smplngDstrbtnRpsChng[newRow,"x_bar"] <- mean(nsample$Age,na.rm = T)
  }
  
}

gSamplingReps <- smplngDstrbtnRpsChng %>%
  plot_ly(
    x = ~x_bar,
    frame = ~reps,
    type = "histogram"
  )
gSamplingReps

While no. of sample size increase the variability of sampling distribution decrease and the mean increase.

Sampling distrubution with a fixed repetitions and different sizes

sizes <- seq(20,260,20)
smplngDstrbtnSzChng <- tibble()
for(i in sizes){
  for(y in 1:1500){
    nsample <- sample_n(train,size=i,replace=T) %>%
      select(Age)
    newRow <- nrow(smplngDstrbtnSzChng) + 1
    smplngDstrbtnSzChng[newRow,"sizes"] <- i
    smplngDstrbtnSzChng[newRow,"x_bar"] <- mean(nsample$Age,na.rm = T)
  }
}
gSamplingSize <- smplngDstrbtnSzChng %>%
  plot_ly(
    x = ~x_bar,
    frame = ~sizes,
    type = "histogram"
  )
gSamplingSize

While no. of sample size increase the variability of sampling distribution decrease,and The sample distribution mean will be normally distributed as long as the sample size is more than 30.

kernel distribution

A kernel distribution is a nonparametric representation of the probability density function (\(pdf\)) of a random variable in any population

The kernel smoothing function defines the shape of the curve used to generate the pdf Kernel distribution is Quote from histogram in other word (smooth representation of a histogram) That the integral =1 There is a benefit of smooth representation of a histogram like Ignores irregularities and outliers , more efficient in approximation so it deals better with large data than small data

\[ \hat{f_h} = \frac{1}{n} = \sum^n_{i = 1} K(x-x_i) = \frac{1}{nh} K\left(\frac{x-x_i}{h}\right) \]

Rules

Non-weighted Data

\[ \hat{f_h} = \frac{1}{n} \sum^n_{i = 1} K(x-x_i) = \frac{1}{nh} K\left(\frac{x-x_i}{h}\right) \]

Weighted Data

\[ \hat{f_h} = \frac{1}{h} \sum^N_{i=1} w_i K \left(\frac{x-x_i}{h}\right), \qquad \text{where} \sum^N_{i= 1} w_i = 1 \]

Kernel Function

  1. Box
  2. Triangle
  3. Normal
  4. Pantechnicon

Each density curve uses the same input data, but applies a different kernel smoothing function to generate the pdf. The density estimates are roughly comparable, but the shape of each curve varies slightly. For example, the box kernel produces a density curve that is less smooth than the others.

The choice of bandwidth value controls the smoothness of the resulting probability density curve (higher value of \(h\) more smoothing)

Specifying a smaller bandwidth produces a very rough curve, but reveals that there might be two major peaks in the data. Specifying a larger bandwidth produces a curve nearly identical to the kernel function Choosing the optimal (\(h\)) bandwidth methods :

  1. Silverman’s rule of thump that computes an optimal h by assuming that data is normally distributed
  2. Improved Sheather Jones (ISJ) an algorithm is more robust with multimodality data or a lot of data (one disadvantage is it needs to large data )

Bounded domains data: have a constrains like data couldn’t be negative ( -ve lead to probability = 0)

Mirror method

  1. Mirror the data
  2. Sum the original and mirrored kernel density estimate
  3. Chop it so that zero at the boundary side

2D

\(h\) : could be matrix (different \(h\) in different directions)

The choice of norm comes into \(d \ge 2\)

The p-norm is \(||x||_p := (\sum_{i=1} |x|^p)^{\frac{1}{p}}\) - norm-p =1 Manhattan distance - norm-p =2 Euclidean norm - norm-p =inf maximum norm (it’s not obvious in every case which norm is the correct one)

standard euclidean distance is good choice because it invariant under rotation as large data choice of k and p isn’t important so

Answering our Questions

Acurrate summary of our data.

Q3) i)

ii)?? iii) Yes, by interval estimation iv)?? Q4) Q5)

gAgeDensity <- train %>%
  select(Age) %>%
  ggplot(aes(Age, y = ..density..)) +
  geom_histogram(bins = 20,binwidth = 1,color=inferno(1,alpha=1)) + 
  geom_density(fill=inferno(1,begin = 0.5,alpha = 0.5),color = inferno(1,begin=0)) + 
  annotate(
    "text",
    x = 70,
    y = 0.04,
    label = paste("Skewness:",skewness(train$Age,na.rm = T)),
    colour = inferno(1,begin = 0.1),
    size = 4
  ) + 
  theme_ipsum_rc()

gAgeDensity

Q6)//?

gFareDensity <- train %>%
  select(Fare) %>%
  ggplot(aes(Fare, y = ..density..)) +
  geom_histogram(bins = 20,binwidth = 1,color=viridis(1,alpha=1)) + 
  geom_density(fill=inferno(1,begin = 0.5,alpha = 0.5),color = viridis(1,begin=0)) + 
  scale_y_continuous(limits = c(0,0.05))+
  theme_ipsum_rc() + 
  annotate(
    "text",
    x = 200,
    y = 0.05,
    label = paste("Skewness",skewness(train$Fare)),
    colour = "black",
    size = 4
  )

gFareDensity

Q8) Q)9,10,11

numOfSamples <- c(50,100,1000)
smplngDstrbtonRpsChng <- tibble()
for(i in numOfSamples){
  for(y in 1:i){
    nsample <- sample_n(train,size=50,replace=T) %>%
      select(Age)
    newRow <- nrow(smplngDstrbtonRpsChng) + 1
    smplngDstrbtonRpsChng[newRow,"reps"] <- i
    smplngDstrbtonRpsChng[newRow,"x_bar"] <- mean(nsample$Age,na.rm = T)
  }
}

gSamplingReps <- smplngDstrbtonRpsChng %>%
  plot_ly(
    x = ~x_bar,
    frame = ~reps,
    type = "histogram"
  )

gSamplingReps

Q12) Q)13,14,15

Q16) Q17,18)

---
title: "Titanic Data Analysis"
author: 
  - "Ahmed Ashraf"
  - "Omar Gamal Abdelaziz"
  - "Ahmed Yousri"
  - "Ahmed Dawood"
  - "Alia Medhat"
output:
  html_notebook:
    toc: yes
  pdf_document:
    toc: yes
    number_sections: yes
    includes:
      in_header: preamble.tex
    highlight: pygments
always_allow_html: yes
editor_options: 
  markdown: 
    wrap: 72
---

# Introduction

In this project we are trying first to explore our data to get a better
understanding To answer our questions we first need to have a
preliminary look at our data, so that we can get a better a idea what we
are dealing with, as well as the possible missing data and relationships
that exist

## Preliminary Look at the data

We need first to define the data we have.

| Variable | Definition                                    | Key                                             |
|:-------------------|:-------------------------|:-------------------------|
| survival | Survival                                      | 0 = No, 1 = yes                                 |
| pclass   | ticket class                                  | 1 = 1st, 2 = 2nd, 3 = 3rd                       |
| sex      | sex                                           |                                                 |
| age      | Age in year                                   |                                                 |
| sibsp    | Number of siblings/spouses aboard the titanic |                                                 |
| parch    | Number of parents/children aboard the Titanic |                                                 |
| ticket   | ticket number(unique)                         |                                                 |
| fare     | Passenger fare                                |                                                 |
| cabin    | Cabin number                                  |                                                 |
| embarked | port of embarkation                           | C = Cherbourg, Q = Queens-town, S = Southampton |

## Loading the packages and the Data

```{r Loading Packages, echo=TRUE, message=FALSE, results='hide'}
# Loading Packages
## tidyverse loads dplyr and readr
library(tidyverse)

## To have different color maps
library(viridis)

## ggplot2 to produce different plots
library(ggplot2)

## uses ggplot2 to produce a correlation matrix -- the data must be in the correct form
library(ggcorrplot)

## Gives us better themes
library(hrbrthemes)

## to use skewness fun. to calculate skewness of the distribution
library(e1071)

## Multivariate imputation using chained equations -- to impute the missing values in our data
library(mice)

## Loads different statistical functions
library(statsr)

## To produce interactive plot
library(plotly)


# Loading Training Data
train <- read_csv("data/train.csv")

# Loading Testing Data
test <- read_csv("data/test.csv")

# Binding them into a full data frame
df <- bind_rows(train,test)
```

# Exploration Of The Data

## Description of the Data

## Summary of Data

```{r Data Summary}
summary(train)
```

## Categories of Features

> Quantitative data are measures of values or counts and are expressed
> as numbers.

> Qualitative data are measures of 'types' and may be represented by a
> name, symbol, or a number code.

### Qualitive

Categorical: Survived, Sex, and Embarked. Ordinal: Pclass. Nominal:
Name.

### Quantitive

Continuous: Age, Fare. Discrete: SibSp, Parch.

### Mix types

Ticket is a mix of numeric and alphanumeric data types Cabin is mix
between alpha and numeric

## Exploring Missing Data.

Checking for Missing values in each feature

```{r Missing Values}
colSums(is.na(train))
```

```{r Missing values plot}
missing_values <- train %>% summarize_all(funs(sum(is.na(.))/n()))

missing_values <- gather(missing_values, key="feature", value="missing_pct")

missing_values

missing_values %>% 
  ggplot(aes(x=reorder(feature,-missing_pct),y=missing_pct)) +
  geom_bar(stat="identity",fill="red") +
  coord_flip() + # to flip the graph
  xlab("Features") +
  ylab("Percentage of missing values")+
  scale_y_continuous(labels=scales::percent) +
  theme_ipsum()
```

### Missing data, it is normal?

It is quite normal to see missing data in any data-set as such data is
collected by manually which means that there might be some error.
Missing data present various problems. First, the absence of data
reduces statistical power, which refers to the probability that the test
will reject the null hypothesis when it is false. Second, the lost data
can cause bias in the estimation of parameters. Third, it can reduce the
representativeness of the samples.

### The solution to the missing data

#### Removal

The removal of the observations that contain missing might cause a
bigger issue where it might an even bigger loss of information which
cause bias in our estimation. But on close observation of our data we
can identify that there are some features that are not extremely useful
and contain missing data, so we can drop such features.

```{r droping cabin}
train <- train %>%
  select(!Cabin)
summary(train)
```

#### Imputation

Imputing the missing data gives the advantage that we use a learning
model to predict such missing data and maintain the distribution of our
data.

## Histogram of Age feature.

```{r Age histogram}
gAgeDensity <- train %>%
  select(Age) %>%
  ggplot(aes(Age, y = ..density..)) +
  geom_histogram(bins = 20,binwidth = 1,color=inferno(1,alpha=1)) + 
  geom_density(fill=inferno(1,begin = 0.5,alpha = 0.5),color = inferno(1,begin=0)) + 
  annotate(
    "text",
    x = 70,
    y = 0.04,
    label = paste("Skewness:",skewness(train$Age,na.rm = T)),
    colour = inferno(1,begin = 0.1),
    size = 4
  ) + 
  theme_ipsum_rc()

gAgeDensity
```

### Population mean and standard deviation of Age feature Before imputation

```{r Summary of Age feature}
train %>%
  summarise(Age_mean = mean(Age,na.rm = T), Age_sd = sd(Age,na.rm = T))
```

## Histogram of Fare feature

```{r Histogram of Fare Feature}
gFareDensity <- train %>%
  select(Fare) %>%
  ggplot(aes(Fare, y = ..density..)) +
  geom_histogram(bins = 20,binwidth = 1,color=viridis(1,alpha=1)) + 
  geom_density(fill=inferno(1,begin = 0.5,alpha = 0.5),color = viridis(1,begin=0)) + 
  scale_y_continuous(limits = c(0,0.05))+
  theme_ipsum_rc() + 
  annotate(
    "text",
    x = 200,
    y = 0.05,
    label = paste("Skewness",skewness(train$Fare)),
    colour = "black",
    size = 4
  )

gFareDensity 
```

### Population mean and standard deviaiton of Fare feature.

```{r Summary of Fare Feature}
train %>%
  summarise(Fare_mean = mean(Fare,na.rm = T), Fare_sd = sd(Fare,na.rm = T))
```

## Imputing the missing age feature

```{r Imputation}
#---------------MICE--------------
set.seed(129)
mice_mod <- mice(train[,!names(train) %in% c('PassengerId','Name','Ticket','Cabin','Survived')],method = 'rf')
mice_output <- complete(mice_mod)

gdistrOriginalData <- train %>%
  select(Age) %>%
  ggplot(aes(Age, y = ..density..)) +
  geom_histogram(bins = 25,binwidth = 1,color=inferno(1,alpha=1)) + 
  geom_density(fill=inferno(1,begin = 0.5,alpha = 0.5),color = inferno(1,begin=0)) +
  ggtitle("Distribution of original data") +
  theme_ipsum_rc()
gdistrMICEData <- mice_output %>%
  select(Age) %>%
  ggplot(aes(Age, y = ..density..)) +
  geom_histogram(bins = 25,binwidth = 1,color=inferno(1,alpha=1)) + 
  geom_density(fill=inferno(1,begin = 0.5,alpha = 0.5),color = inferno(1,begin=0)) + 
  ggtitle("Distribution of mice output") +
  theme_ipsum_rc()

gridExtra::grid.arrange(gdistrOriginalData,gdistrMICEData,nrow = 1)

train$Age <- mice_output$Age

missing_values <- train %>% summarize_all(funs(sum(is.na(.))/n()))

missing_values <- gather(missing_values, key="feature", value="missing_pct")

missing_values

missing_values %>% 
  ggplot(aes(x=reorder(feature,-missing_pct),y=missing_pct)) +
  geom_bar(stat="identity",fill="red") +
  coord_flip() + # to flip the graph
  xlab("Features") +
  ylab("Percentage of missing values")+
  scale_y_continuous(labels=scales::percent) +
  theme_ipsum()
```

### Population mean and standard deviation

```{r Summary of Age feature After Imputation}
train %>%
  summarise(Age_mean = mean(Age,na.rm = T), Age_sd = sd(Age,na.rm = T))
```

## Determining the distrubutuion of Age and Fare By inspection

Though determination of the distribution using inspection is likely not
going to be effective we are going to the KS-test in a later section

### Age

The histogram of the Age feature look very much like a normal
distribution, yet it's not a normal distribution itself.

### Fare

The histogram of the Fare feature fits the chi-square distribution.

## Plotting The Data

We include even more plots of our data to get a better understanding of
it, help us see hidden correlations and ways to facilitate our analysis

### Correlation Matrix

We are going to use correlation matrix of the numerical data to assess
the correlation, which might gives a better idea of which feature might
be important

```{r Correlation Matrix, fig.align='center'}
correlationMatrix <- train %>%
  filter(!is.na(Age)) %>%
  select(Survived, Pclass,Age,SibSp,Parch,Fare) %>%
  cor() %>%
  ggcorrplot(lab = T,
             ggtheme =theme_ipsum_rc(grid = F),
             title="Correlation Matrix",hc.order=T,
             colors =rev(viridis(3,alpha=0.7)),
             digits = 1)

correlationMatrix
```

The fare features seems to be the most correlated feature to survival of
the passengers, but it doesn't negate the importance of the other
features in the data. Which means that we will start by comparing the
each that we consider to be important against survival feature

### Class of Passenger Vs Survived

```{r PclassVSSurvived,fig.align='center'}
gPclassSurvived <- train %>%
  select(Pclass,Survived) %>%
  ggplot(aes(as_factor(Pclass),fill=as_factor(Survived))) + 
  geom_bar(position = "fill") +
  scale_y_continuous(labels=scales::percent) +
  theme_ipsum_rc() + 
  labs(x = "Classes",y = "Survival Rate")+
  scale_fill_discrete(name = "Survived", labels = c("Didn't Survive","Survived"))

gPclassSurvived
```

From this Plot, it seems clear that people from the upper classes had higher survival rates, thought this seemed obvious from the beginning.
### Siblings and Spouses Vs Survived

```{r SibSpVSSurvived,fig.align='center'}
gSibSpSurvived <- train %>%
  select(SibSp,Survived) %>%
  ggplot(aes(as_factor(SibSp),fill=as_factor(Survived))) +
  geom_bar(position = "fill") + 
  scale_y_continuous(labels = scales::percent) +
  labs(x = "Siblings and Spouses",y = "Survival Rate")+
  scale_fill_discrete(name = "Survived", labels = c("Didn't Survive","Survived")) +
  theme_ipsum()

gSibSpSurvived
```

This plot highlight a point that might be counter-intuitive that people that have no siblings or spouses had lower survival rates than those with at least one sibling or a spouse

### Number of children/parents Vs Survived

```{r fig.align='center'}
gParchSurvived <- train %>%
  select(Parch,Survived) %>%
  ggplot(aes(as_factor(Parch),fill=as_factor(Survived))) + 
  geom_bar(position = "fill") +
  scale_y_continuous(label = scales::percent)+
  labs(x = "Number of parents/children",y = "Survival Rate")+
  scale_fill_discrete(name = "Survived", labels = c("Didn't Survive","Survived")) +
  theme_ipsum_rc()
gParchSurvived
```

### Gender VS Survived

```{r fig.align='center'}
gSexSurvived <- train %>%
  select(Sex,Survived) %>%
  ggplot(aes(as_factor(Sex),fill = as_factor(Survived))) + 
  geom_bar(position = "fill") +
  scale_y_continuous(label = scales::percent) + 
  labs(x = "Sex",y = "Survival Rate")+
  scale_fill_discrete(name = "Survived", labels = c("Didn't Survive","Survived")) +
  theme_ipsum_rc()
print(gSexSurvived)
```

### Dashboard of the previous graphs

```{r}
gridExtra::grid.arrange(gPclassSurvived,
                        gSibSpSurvived,
                        gSexSurvived,
                        gParchSurvived,
                        nrow=2)
```

```{r}
train %>%
  group_by(Sex) %>%
  summarise(Age_mean = mean(Age,na.rm=TRUE),
            age_sd = sd(Age,na.rm=T),
            surival_mean = mean(Survived,na.rm =T),
            surival_sd = sd(Survived,na.rm = T))

```

## Sampling

### Random sample of size of 50 (Age)

```{r Sampling-1}
sample_50 <- sample_n(train,size = 50,replace = T) %>%
  summarise(mean = mean(Age), sd = sd(Age))
sample_50
```
### Sampling distrubution with a fixed size (50)
```{r}
numOfSamples <- seq(50,1000,50)

smplngDstrbtnRpsChng <- tibble()

for(i in numOfSamples){
  
  for(y in 1:i){
    nsample <- sample_n(train,size=50,replace=T) %>%
      select(Age)
    newRow <- nrow(smplngDstrbtnRpsChng) + 1
    smplngDstrbtnRpsChng[newRow,"reps"] <- i
    smplngDstrbtnRpsChng[newRow,"x_bar"] <- mean(nsample$Age,na.rm = T)
  }
  
}

gSamplingReps <- smplngDstrbtnRpsChng %>%
  plot_ly(
    x = ~x_bar,
    frame = ~reps,
    type = "histogram"
  )
gSamplingReps
```
While no. of sample size increase the variability of sampling
distribution decrease and the mean increase.

### Sampling distrubution with a fixed repetitions and different sizes

```{r sampling-distrubution-sizes}
sizes <- seq(20,260,20)
smplngDstrbtnSzChng <- tibble()
for(i in sizes){
  for(y in 1:1500){
    nsample <- sample_n(train,size=i,replace=T) %>%
      select(Age)
    newRow <- nrow(smplngDstrbtnSzChng) + 1
    smplngDstrbtnSzChng[newRow,"sizes"] <- i
    smplngDstrbtnSzChng[newRow,"x_bar"] <- mean(nsample$Age,na.rm = T)
  }
}
gSamplingSize <- smplngDstrbtnSzChng %>%
  plot_ly(
    x = ~x_bar,
    frame = ~sizes,
    type = "histogram"
  )
gSamplingSize
```

While no. of sample size increase the variability of sampling
distribution decrease,and The sample distribution mean will be normally
distributed as long as the sample size is more than 30.

# kernel distribution

> A kernel distribution is a nonparametric representation of the
> probability density function ($pdf$) of a random variable in any
> population

The kernel smoothing function defines the shape of the curve used to
generate the pdf Kernel distribution is Quote from histogram in other
word (smooth representation of a histogram) That the integral =1 There
is a benefit of smooth representation of a histogram like Ignores
irregularities and outliers , more efficient in approximation so it
deals better with large data than small data

$$ \hat{f_h} = \frac{1}{n} = \sum^n_{i = 1} K(x-x_i)  = \frac{1}{nh} K\left(\frac{x-x_i}{h}\right) $$

## Rules

### Non-weighted Data

$$ \hat{f_h} = \frac{1}{n} \sum^n_{i = 1} K(x-x_i)  = \frac{1}{nh} K\left(\frac{x-x_i}{h}\right) $$

### Weighted Data

$$
\hat{f_h} = \frac{1}{h} \sum^N_{i=1} w_i K \left(\frac{x-x_i}{h}\right), \qquad \text{where} \sum^N_{i= 1} w_i  = 1
$$

## Kernel Function

1.  Box
2.  Triangle
3.  Normal
4.  Pantechnicon

Each density curve uses the same input data, but applies a different
kernel smoothing function to generate the pdf. The density estimates are
roughly comparable, but the shape of each curve varies slightly. For
example, the box kernel produces a density curve that is less smooth
than the others.

The choice of bandwidth value controls the smoothness of the resulting
probability density curve (higher value of $h$ more smoothing)

Specifying a smaller bandwidth produces a very rough curve, but reveals
that there might be two major peaks in the data. Specifying a larger
bandwidth produces a curve nearly identical to the kernel function
Choosing the optimal ($h$) bandwidth methods :

1.  Silverman's rule of thump that computes an optimal h by assuming
    that data is normally distributed
2.  Improved Sheather Jones (ISJ) an algorithm is more robust with
    multimodality data or a lot of data (one disadvantage is it needs to
    large data )

Bounded domains data: have a constrains like data couldn't be negative (
-ve lead to probability = 0)

## Mirror method

1.  Mirror the data
2.  Sum the original and mirrored kernel density estimate
3.  Chop it so that zero at the boundary side

## 2D

$h$ : could be matrix (different $h$ in different directions)

The choice of norm comes into $d \ge 2$

The p-norm is $||x||_p := (\sum_{i=1} |x|^p)^{\frac{1}{p}}$ - norm-p =1
Manhattan distance - norm-p =2 Euclidean norm - norm-p =inf maximum norm
(it's not obvious in every case which norm is the correct one)

standard euclidean distance is good choice because it invariant under
rotation as large data choice of k and p isn't important so

# Answering our Questions

## Acurrate summary of our data.

Q3) i)

ii)?? iii) Yes, by interval estimation iv)?? Q4) Q5)

```{r}
gAgeDensity <- train %>%
  select(Age) %>%
  ggplot(aes(Age, y = ..density..)) +
  geom_histogram(bins = 20,binwidth = 1,color=inferno(1,alpha=1)) + 
  geom_density(fill=inferno(1,begin = 0.5,alpha = 0.5),color = inferno(1,begin=0)) + 
  annotate(
    "text",
    x = 70,
    y = 0.04,
    label = paste("Skewness:",skewness(train$Age,na.rm = T)),
    colour = inferno(1,begin = 0.1),
    size = 4
  ) + 
  theme_ipsum_rc()

gAgeDensity
```

Q6)//?

```{r}
gFareDensity <- train %>%
  select(Fare) %>%
  ggplot(aes(Fare, y = ..density..)) +
  geom_histogram(bins = 20,binwidth = 1,color=viridis(1,alpha=1)) + 
  geom_density(fill=inferno(1,begin = 0.5,alpha = 0.5),color = viridis(1,begin=0)) + 
  scale_y_continuous(limits = c(0,0.05))+
  theme_ipsum_rc() + 
  annotate(
    "text",
    x = 200,
    y = 0.05,
    label = paste("Skewness",skewness(train$Fare)),
    colour = "black",
    size = 4
  )

gFareDensity
```

Q8) Q)9,10,11

```{r}
numOfSamples <- c(50,100,1000)
smplngDstrbtonRpsChng <- tibble()
for(i in numOfSamples){
  for(y in 1:i){
    nsample <- sample_n(train,size=50,replace=T) %>%
      select(Age)
    newRow <- nrow(smplngDstrbtonRpsChng) + 1
    smplngDstrbtonRpsChng[newRow,"reps"] <- i
    smplngDstrbtonRpsChng[newRow,"x_bar"] <- mean(nsample$Age,na.rm = T)
  }
}

gSamplingReps <- smplngDstrbtonRpsChng %>%
  plot_ly(
    x = ~x_bar,
    frame = ~reps,
    type = "histogram"
  )

gSamplingReps
```

Q12)  Q)13,14,15

```{r eval=FALSE, include=FALSE}
sizes <- seq(20,200,20)
smplngDstrbtonSzChng <- tibble()
for(i in sizes){
  for(y in 1:1500){
    nsample <- sample_n(train,size=i,replace=T) %>%
      select(Age)
    newRow <- nrow(smplngDstrbtonSzChng) + 1
    smplngDstrbtonSzChng[newRow,"sizes"] <- i
    smplngDstrbtonSzChng[newRow,"x_bar"] <- mean(nsample$Age,na.rm = T)
  }
}
gSamplingSize <- smplngDstrbtonSzChng %>%
  plot_ly(
    x = ~x_bar,
    frame = ~sizes,
    type = "histogram"
  )

gSamplingSize
```

Q16) Q17,18)

```{r eval=FALSE, include=FALSE}
sizes <- seq(2,50,2)
smplngDstrbtonSzChng <- tibble()
for(i in sizes){
  for(y in 1:1500){
    nsample <- sample_n(train,size=i,replace=T) %>%
      select(Age)
    newRow <- nrow(smplngDstrbtonSzChng) + 1
    smplngDstrbtonSzChng[newRow,"sizes"] <- i
    smplngDstrbtonSzChng[newRow,"variance"] <- sd(nsample$Age,na.rm = T)**2
  }
}
gSamplingSize <- smplngDstrbtonSzChng %>%
  plot_ly(
    x = ~variance,
    frame = ~sizes,
    type = "histogram"
  )

gSamplingSize
```
